AITopics | pre-training bert

Collaborating Authors

pre-training bert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models

Yuan, Bithiah

arXiv.org Artificial IntelligenceMay-5-2025

Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, Question Answering (QA) systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. Consequently, we propose a novel financial QA system using the transformer-based pre-trained BERT language model to address the limitations of data scarcity and language specificity in the financial domain. Our system focuses on financial non-factoid answer selection, which retrieves a set of passage-level texts and selects the most relevant as the answer. To increase efficiency, we formulate the answer selection task as a re-ranking problem, in which our system consists of an Answer Retriever using BM25, a simple information retrieval approach, to first return a list of candidate answers, and an Answer Re-ranker built with variants of pre-trained BERT language models to re-rank and select the most relevant answers. We investigate various learning, further pre-training, and fine-tuning approaches for BERT. Our experiments suggest that FinBERT-QA, a model built from applying the Transfer and Adapt further fine-tuning and pointwise learning approach, is the most effective, improving the state-of-the-art results of task 2 of the FiQA dataset by 16% on MRR, 17% on NDCG, and 21% on Precision@1.

large language model, machine learning, question answering, (23 more...)

arXiv.org Artificial Intelligence

2505.00725

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Banking & Finance > Financial Services (0.48)
Health & Medicine > Consumer Health (0.46)
Government > Tax (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

FaBERT: Pre-training BERT on Persian Blogs

Masumi, Mostafa, Majd, Seyed Soroush, Shamsfard, Mehrnoush, Beigy, Hamid

arXiv.org Artificial IntelligenceFeb-9-2024

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse and cleaned corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications. FaBERT is openly accessible at https://huggingface.co/sbunlp/fabert

corpus, dataset, fabert, (13 more...)

arXiv.org Artificial Intelligence

2402.06617

Country: North America > United States (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The Effects of In-domain Corpus Size on pre-training BERT

Sanchez, Chris, Zhang, Zheyuan

arXiv.org Artificial IntelligenceDec-15-2022

Web scraping Encoder Representations from Transformers is one oft-cited method used to gather publicly (BERT) (Devlin et al., 2018) and its variants available documents to increase one's in-domain (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) training corpora. For example, LEGAL-BERT has proven to be an excellent strategy and achieved (Chalkidis et al., 2020) authors scraped publicly state-of-the-art results on many downstream natural available legal text from six different sources, to language processing (NLP) tasks. Most models achieve a total corpus size of 12 GB. Nevertheless, focused their pre-training efforts on general domain this data collection process is laborious and text. For example, the original BERT model was time-consuming and could discourage researchers trained on Wikipedia and the BookCorpus (Zhu from conducting such experiments for fear of being et al., 2015). Many other following efforts focused unable to collect enough data. On the other hand, on adding additional texts to the pre-training process it would also be a waste of resources if, after all to create even larger models with the intent the data is collected, it turns out the data is still of improving model performance (Liu et al., 2019; not enough for pre-training and the model ends up Raffel et al., 2019). However, recent works have having poor performance.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.07914

Country: North America > United States > Virginia > Fairfax County > Reston (0.04)

Genre: Research Report > New Finding (0.69)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Pre-Training BERT on Arabic Tweets: Practical Considerations

Abdelali, Ahmed, Hassan, Sabit, Mubarak, Hamdy, Darwish, Kareem, Samih, Younes

arXiv.org Artificial IntelligenceFeb-21-2021

Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.

bert, proceedings, tweet, (15 more...)

arXiv.org Artificial Intelligence

2102.10684

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Germany > Berlin (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Pre-training BERT from scratch with cloud TPU

#artificialintelligenceJun-14-2019, 15:56:52 GMT

In this experiment, we will be pre-training a state-of-the-art Natural Language Understanding model BERT on arbitrary text data using Google Cloud infrastructure. With this guide, you will be able to train a BERT model on arbitrary text data. This is useful if a pre-trained model for your language or use case is not available in open source. This guide is intended for NLP researchers who are excited with the BERT technology but are not satisfied with the performance of the available open-sourced models. For persistent storage of training data and model, you will require a Google Cloud Storage bucket.

cloud computing, machine learning, pre-training bert, (7 more...)

#artificialintelligence

Industry: Information Technology > Services (0.86)

Technology:

Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback